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Abstract 

This paper surveys recent work in applying ideas from graphical models and message passing 
algorithms to solve large scale regularized regression problems. In particular, the focus is on 
compressed sensing reconstruction via t\ penalized least-squares (known as LASSO or BPDN). 
We discuss how to derive fast approximate message passing algorithms to solve this problem. 
Surprisingly, the analysis of such algorithms allows to prove exact high-dimensional limit results 
for the LASSO risk. 

This paper will appear as a chapter in a book on 'Compressed Sensing' edited by Yonina Eldar 
and Gitta Kutyniok. 

1 Introduction 

The problem of reconstructing a high-dimensional vector x € W 1 from a collection of observations 
y E M. m arises in a number of contexts, ranging from statistical learning to signal processing. It is 
often assumed that the measurement process is approximately linear, i.e. that 

y = Ax + w, (1.1) 

where A S ]£ mxn i s a known measurement matrix, and w is a noise vector. 

The graphical models approach to such reconstruction problem postulates a joint probability 
distribution on (x, y) which takes, without loss of generality, the form 

p(dx, dy) = p(dy\x) p(dx) . (1.2) 

The conditional distribution p(dy\x) models the noise process, while the prior p(dx) encodes informa- 
tion on the vector x. In particular, within compressed sensing, it can describe its sparsity properties. 
Within a graphical models approach, either of these distributions (or both) factorizes according to a 
specific graph structure. The resulting posterior distribution p(dx\y) is used for inferring x given y. 

There are many reasons to be skeptical about the idea that the joint probability distribution 
p(dx, dy) can be determined, and used for reconstructing x. To name one such reason for skepticism, 
any finite sample will allow to determine the prior distribution of x, p(dx) only within limited 
accuracy. A reconstruction algorithm based on the posterior distribution p(dx\y) might be sensitive 
with respect to changes in the prior thus leading to systematic errors. 

One might be tempted to drop the whole approach as a consequence. We argue that sticking to 
this point of view is instead fruitful for several reasons: 
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1. Algorithmic. Several existing reconstruction methods are in fact M-estimators, i.e. they are de- 
fined by minimizing an appropriate cost function CA,y(%) over x € W 1 [vdVOO] . Such estimators 
can be derived as Bayesian estimators (e.g. maximum a posteriori probability) for specific forms 
of p(dx) and p(dy\x) (for instance by letting p(dx\y) oc exp{— CA,y(x)) dx ). The connection 
is useful both in interpreting/comparing different methods, and in adapting known algorithms 
for Bayes estimation. A classical example of this cross-fertilization is the paper |FN03| . This 
review discusses several other examples in that build on graphical models inference algorithms. 

2. Minimax. When the prior p(dx) or the noise distributions, and therefore the conditional 
distribution p(dy\x), 'exist' but are unknown, it is reasonable to assume that they belong to 
specific structure classes. By this term we refer generically to a class of probability distributions 
characterized by a specific property. For instance, within compressed sensing one often assumes 
that x has at most k non-zero entries. One can then take p(dx) to be a distribution supported on 
/c-sparse vectors x € M. n . If T n \. denotes the class of such distributions, the minimax approach 
strives to achieve the best uniform guarantee over T n k- In other words, the minimax estimator 
achieves the smallest expected error (e.g. mean square error) for the 'worst' distribution in 

It is a remarkable fact in statistical decision theory [LC98] (which follows from a generalization 
of Von Neumann minimax theorem) that the minimax estimator coincides with the Bayes 
estimator for a specific (worst case) prior p G J~ n ^. In one dimension considerable information 
is available about the worst case distribution and asymptotically optimal estimators (see Section 
[3]). The methods developed here allow to develop similar insights in high-dimension. 

3. Modeling. In some applications it is possible to construct fairly accurate models both of the 
prior distribution p(dx) and of the measurement process p(dy\x). This is the case for instance 
in some communications problems, whereby x is the signal produced by a transmitter (and 
generated uniformly at random according to a known codebook), and w is the noise produced 
by a well-defined physical process (e.g. thermal noise in the receiver circuitry). A discussion 
of some families of practically interesting priors p(dx) can be found in |Cev08j . 

Further, the question of modeling the prior in compressed sensing is discussed from the point of view 
of Bayesian theory in |JXC08| . 

The rest of this chapter is organized as follows. Section [2] describes a graphical model natu- 
rally associated to the compressed sensing reconstruction problem. Section [3] provides important 
background on the one-dimensional case. Section [5] describes a standard message passing algorithm 
-the min-sum algorithm- and how it can be simplified to solve the LASSO optimization problem. 
The algorithm is further simplified in Section [5] yielding the AMP algoritm. The analysis of this 
algorithm is outlined in Section [6j As a consequence of this analysis, it is possible to compute exact 
high-dimensional limits for the behavior of the LASSO estimator. Finally in Section [7] we discuss a 
few examples of how the approach developed here can be used to address reconstruction problems 
in which a richer structural information is available. 

1.1 Some useful notation 

Throughout this review, probability measures over the real line 1R or the euclidean space 1^ play 
a special role. It is therefore useful to be careful about the probability-theory notation. The less 
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careful reader who prefers to pass directly to the 'action' is invited to skip these remarks at a first 
reading. 

We will use the notation p or p(dx) to indicate probability measures (eventually with subscripts). 
Notice that, in the last form, the dx is only a reminder of which variable is distributed with measure 
p. (Of course one is tempted to think of dx as an infinitesimal interval but this intuition is accurate 
only if p admits a density.) 

A special measure (positive, but not normalized and hence not a probability measure) is the 
Lebesgue measure for which we reserve the special notation dx (something like //(dx) would be 
more consistent but, in our opinion, less readable). This convention is particularly convenient for 
expressing in formulae statements of the form 'p admits a density f with respect to Lebesgue measure 
dx, with f \ x i — y f{x) = exp(— x 2 /{2a))/ \J 2ira a Borel function', which we write simply 

p(dx) = e- x2 ' 2a dx . (1.3) 

V27ro 

It is well known that expectations are defined as integrals with respect to the probability measure 
which we denote as 

Mf}= I f(x)p(dx), (1.4) 

sometimes omitting the subscript p in E p and M in L. Unless specified otherwise, we do not assume 
such probability measures to have a density with respect to Lebesgue measure. The probability 
measure p is a set function defined on the Borel cr-algebra, see e.g. |Bil951 IWil91j . Hence it makes 
sense to write p((— 1,3]) (the probability of the interval (—1,3] under measure p) or p({0}) (the 
probability of the point 0). Equally valid would be expressions such as drc((— 1,3]) (the Lebesgue 
measure of (—1, 3]) or p(dx)((— 1, 3]) (the probability of the interval (—1, 3] under measure p) but we 
avoid them as somewhat clumsy. 

A (joint) probability measure over x € M. K and y 6 M. L will be denoted by p(dx, dy) (this is just 
a probability measure over M. K x W L = M. K+L ). The corresponding conditional probability measure 
of y given x is denoted by p(dx\y) (for a rigorous definition we refer to [Bil95|. IWil91] ). 

Finally, we will not make use of cumulative distribution functions -commonly called distribu- 
tion functions in probability theory- and instead use 'probability distribution' interchangeably with 
'probability measure'. 

Some fairly standard discrete mathematics notation will also be useful. The set of first K integers 
is to be denoted by [K] = {1, . . . , K}. Order of growth of various functions will be characterized by 
the standard 'big-0' notation. Recall in particular that, for M — > oo, one writes f(M) = 0{g{M)) 
if f{M) < Cg(M) for some finite constant C, f{M) = n(g(M)) if f(M) > g{M)/C and f(M) = 
&(g(M)) if g(M)/C < f(M) < Cg(M). Further f(M) = o(g(M)) if f(M)/g(M) 0. Analogous 
notations are used when the argument of / and g go to 0. 

2 The basic model and its graph structure 

Specifying the conditional distribution of y given x is equivalent to specifying the distribution of the 
noise vector w. In most of this chapter we shall take p(w) to be a Gaussian distribution of mean 
and variance /3 _1 I, whence 

/ (3 \n/2 ( P 2 1 
pp{dy\x) = (-) expj - -Ay- Ax\\ 2 ^dy. (2.1) 
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The simplest choice for the prior consists in taking p{dx) to be a product distribution with identical 
factors p(dx) = p(dx\) x • • • x p{dx n ). We thus obtain the joint distribution 

Pf}(dx, dy) = (— ) expj - -||y- Ac|||}dy JJ^drCj). (2.2) 

i=l 

It is clear at the outset that generalizations of this basic model can be easily defined, in such a way 
to incorporate further information on the vector x or on the measurement process. As an example, 
consider the case of block-sparse signals: The index set [n] is partitioned into blocks B(l), -8(2), 
. . . B(£) of equal length n/£, and only a small fraction of the blocks is non- vanishing. This situation 
can be captured by assuming that the prior p(dx) factors over blocks. One thus obtains the joint 
distribution 

/2 ^ 

Pp(dx, dy) = (— ) ex p{ - ollf -^llijdy Y[p(dx B ^) , (2.3) 

where x^U) = (xi : i £ B(j)) G W 1 ^. Other examples of structured priors will be discussed in 
Section [7J 

The posterior distribution of x given observations y admits an explicit expression, that can be 
derived from Eq. f|2 . 2|> : 

1/3 n 
Pf){$x\ y) = -^y exp { - - Axg} IJp(da*) , (2.4) 

where Z(y) = (2vr//3) n / 2 p(y) ensures the normalization J p(dx\y) = 1. Let us stress that while this 
expression is explicit, computing expectations or marginals of this distribution is a hard computa- 
tional task. 

Finally, the square residuals \\y — Ax\\2 decompose in a sum of m terms yielding 

1 m 8 2 n 

P/3(dx\ y) = — — Yl exp | - -{y a - A^x) 2 } ~[[p(dxi) , (2.5) 

a=l i=l 

where A a is the a-th row of the matrix a. This factorized structure is conveniently described by a 
factor graph, i.e. a bipartite graph including a 'variable node' i G [n] for each variable Xi, and a 
'factor node' a € [m] for each term il) a { x ) — exp{— 8(y a — A^x) 2 /2}. Variable i and factor a are 
connected by an edge if and only if ijj a (x) depends non-trivially on x^, i.e. if A a i ^ 0. One such 
factor graph is reproduced in Fig. [TJ 

An estimate of the signal can be extracted from the posterior distribution (|2.5p in various ways. 
One possibility is to use conditional expectation 



x/3{y,p)= xpp(dx\y). (2.6) 

Classically, this estimator is justified by the fact that it achieves the minimal mean square provided 
the pp(dx,dy) is the actual joint distribution of (x,y). In the present context we will not assume 
that 'postulated' prior pp(dx) coincides with the actual distribution f x, and hence xp(y;p) is not 
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Figure 1: Factor graph associated to the probability distribution (|2.5[) . Empty circles correspond to variables 
Xi, i € [n] and squares correspond to measurements y a , a £ [m]. 



necessarily optimal (with respect to mean square error). The best justification for xp(y;p) is that a 
broad class of estimators can be written in the form (|2.6p . 

An important problem with the estimator (|2.6p is that it is in general hard to compute. In order to 
obtain a tractable proxy, we assume that p(dxi) = pp t h{dxi) = c fp t h(xi) dxi for fp t h{xi) = e~^ Xi ' 
an un-normalized probability density function. As j3 get large, the integral in Eq. (|2.6j) becomes 
dominated by the vector x with the highest posterior probability pp. One can then replace the 
integral in dx with a maximization over x and define 

x(y; h) = argmin z6R „CA,j/(z; h) , (2.7) 

1 n 
C A , v {z\K) = -\\y - Az\\l + ^2h(zi) , 

i=i 

where we assumed for simplicity that C A>y {z; h) has a unique minimum. 

According to the above discussion, the estimator x(y; h) can be thought of as the /3 — > oo limit 
of the general estimator (I2.6p . Indeed, it is easy to check that, provided Xj i— > h{x{) is upper 
semicontinuous, we have 

lim xp(y;pp th ) = x{y;h) . 

In other words, the posterior mean converges to the mode of the posterior in this limit. Further, 
x(y; h) takes the familiar form of a regression estimator with separable regularization. If h( ■ ) is 
convex, the computation of x is tractable. Important special cases include h(xi) = Xxf, which 
corresponds to ridge regression |HTF03| . and h(xi) = X\xi\ which corresponds to the LASSO |Tib96| 
or basis pursuit denoising (BPDN) [CD95]. Due to the special role it plays in compressed sensing, we 
will devote special attention to the latter case, that we rewrite explicitly below with a slight abuse 
of notation 

x(y) = &Tgmm zm nC A ,y(z) , (2.8) 
Ca, v (z) = 2 \\v ~ Az \\l + A ll z lli ■ 
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3 Revisiting the scalar case 



Before proceeding further, it is convenient to pause for a moment and consider the special case of a 
single measurement of a scalar quantity, i.e. the case m = n = 1. We therefore have 

y = X + W, (3.1) 

and want to estimate x from y. Despite the apparent simplicity, there exists a copious literature 
on this problem with many open problems [DJHS92| IDJ94bl IDJ94al IJoh02] . Here we only want to 
clarify a few points that will come up again in what follows. 

In order to compare various estimators we will assume that (x, y) are indeed random variables 
with some underlying probability distribution po{dx, dy) = po(dx)po(dy\x). It is important to stress 
that this distribution is conceptually distinct from the one used in inference, cf. Eq. (|2.6p . In 
particular we cannot assume to know the actual prior distribution of x, at least not exactly, and 
hence p(dx) and po(dx) do not coincide. The 'actual' prior po is the distribution of the vector to be 
inferred, while the 'postulated' prior p is a device used for designing inference algorithms. 

For the sake of simplicity we also consider Gaussian noise w ~ N(0, a 2 ) with known noise level 
a 2 . Various estimators will be compared with respect to the resulting mean square error 

MSE = E{\x(y) - x\ 2 } = / \x{y) - x\ 2 p (dx, dy) . 

We can distinguish two cases: 

I. The signal distribution po(x) is known as well. This can be regarded as an 'oracle' setting. To 
make contact with compressed sensing, we consider distributions that generate sparse signals, 
i.e. that put mass at least 1 — e on x = 0. In formulae po({0}) > 1 — e. 

II. The signal distribution is unknown but it is known that it is 'sparse', namely that it belongs 
to the class 

Fe = {po : Po({0}) > 1-e}. (3.2) 
The minimum mean square error, is the minimum MSE achievable by any estimator x : R — > R: 

MMSE(fj 2 ;p ) = inf E{\x(y) - x\ 2 } . 
It is well known that the infimum is achieved by the conditional expectation 

x MMSE (y) = [ x Po (dx\y). 

However, this estimator assumes that we are in situation I above, i.e. that the prior po is known. 
In Figure [2] we plot the resulting MSE for a 3 point distribution, 

Po = | hi + (1 - e) S + | S-x . (3.3) 

The MMSE is non-decreasing in a 2 by construction, converges to in the noiseless limit a — > 
(indeed the simple rule x(y) = y achieves MSE equal to a 2 ) and to e in the large noise limit a — > oo 
(MSE equal to e is achieved by x = 0). 
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Figure 2: Mean square error for estimating a three points random variable, with probability of non-zero e = 
0.1, in Gaussian noise. Red line: Minimal mean square error achieved by conditional expectation (thick) and 
its large noise asymptote (thin). Blue line: Mean square error for LASSO or cquivalently for soft thresholding 
(thick) and its small noise asymptote (thin). 
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In the more realistic situation II, we do not know the prior pq. A principled way to deal with 
this ignorance would be to minimize the MSE for the worst case distribution in the class J- e , i.e. to 
replace the minimization in Eq. (|3,3p with the following minimax problem 

inf sup E{|x(y) - x\ 2 } . (3.4) 

A lot is known about this problem |D JHS92"| ID J94b] ID J94a[ IJoh02j . In particular general statistical 
decision theory |LC98l IJoh02j implies that the optimum estimator is just the posterior expectation 
for a specific worst case prior. Unfortunately, even a superficial discussion of this literature goes 
beyond the scope of the present review. 

Nevertheless, an interesting exercise (indeed not a trivial one) is to consider the LASSO estimator 
(|2.8p . which in this case reduces to 

x(y;X) = argmin a6R |-(y - zf + A|z|| . (3.5) 

Notice that this estimator is insensitive to the details of the prior pq. Instead of the full minimax 
problem (|3.4p . one can then simply optimize the MSE over A. 

The one-dimensional optimization problem (|3.5p admits an explicit solution in terms of the soft 
thresholding function r\ : R x R + — > K defined as follows 

y - 9 if y > 9, 

V (y,6) = { if-9<y<9, (3.6) 

y + 9 if y < -6. 
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The threshold value 9 has to be chosen equal to the regularization parameter A yielding the simple 
solution 



x(y;\) = r ] (y;9), for A = 9. (3.7) 

(We emphasize the identity of A and 9 in the scalar case, because it breaks down in the vector case.) 

How should the parameter 9 (or equivalently A) be fixed? The rule is conceptually simple: 9 
should minimize the maximal mean square error for the class T e . Remarkably this complex saddle 
point problem can be solved rather explicitly. The key remark is that the worst case distribution 
over the class T e can be identified and takes the form = (e/2)5 +oc + (1 — e)5q + (e/2)(5_ 00 
|DJ94bllDJ94allJoh02j . 

Let us outline how the solution follows from this key fact. First of all, it makes sense to scale A 
as the noise standard deviation, because the estimator is supposed to filter out the noise. We then 
let 9 = aa. In Fig. [2] we plot the resulting MSE when 9 = aa, with a ~ 1.1402. We denote the 
LASSO/soft thresholding mean square error by mse(a 2 ;po, a) when the noise variance is a 2 , x ~ po, 
and the regularization parameter is A = 9 = aa. The worst case mean square error is given by 
sup„ og jr. mse(<7 2 ;po> Since the class F £ is invariant by rescaling, this worst case MSE must be 
proportional to the only scale in the problem, i.e., a 2 . We get 

sup mse(a 2 ;po,a) = M(e,a)a 2 . (3.8) 

The function M can be computed explicitly by evaluating the mean square error on the worst case 
distribution p* |DJ94bl IDJ94aj IJoh02j . A straightforward calculation (see also [DMM091 Supple- 
mentary Information], and [DMMlOb] ) yields 

M(e,a) =£(l + a 2 ) + (l-e)[2(l + a 2 )<5>(-a) -2a 0(a)] (3.9) 

where <p(z) = e~ z2 ^ 2 /\/2tt is the Gaussian density and &(z) = 4>( u ) is the Gaussian distri- 
bution. It is also not hard to show that that M(e, a) is the slope of the soft thresholding MSE at 
a 2 = in a plot like the one in Fig. [2j 

Minimizing the above expression over a, we obtain the soft thresholding minimax risk, and the 
corresponding optimal threshold value 

M*(e) = min M(s,a) , a*(e) = arg min M(e,a) . (3.10) 



The functions M*(e) and a*(e) are plotted in Fig. [3j For comparison we also plot the analogous 
functions when the class T e is replaced by T e [a) = {po G T £ : J x 2 po(dx) < a 2 } of sparse random 
variables with bounded second moment. Of particular interest is the behavior of these curves in the 
very sparse limit e — > 0, 

Af#(e)=2eIog(l/e){l + o(l)}, a*(e) = V21og(l/e) {l + o(l)} . (3.11) 

Getting back to Fig. [21 the reader will notice that there is a significant gap between the minimal 
MSE and the MSE achieved by soft-thresholding. This is the price paid by using an estimator that is 
uniformly good over the class JF e instead of one that is tailored for the distribution po at hand. Figured] 
compares the two estimators for a = 0.3. One might wonder whether all this price has to be paid, 
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Figure 3: Left frame (red line): minimax mean square error under soft thresholding for estimation of e-sparsc 
random variable in Gaussian noise. Blue lines correspond to signals of bounded second moment (labels on 
the curves refer to the maximum allowed value of [J x 2 poi&x)} 1 / 2 ). Right frame (red line): Optimal threshold 
level for the same estimation problem. Blue lines again refer to the case of bounded second moment. 
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Figure 4: Red line: The MMSE estimator for the three-point distribution (|3.3[) with e = 0.1, when the noise 
has standard deviation a = 0.3. Blue line: the minimax soft threshold estimator for the same setting. The 
corresponding mean square errors are plotted in Fig. [2] 




9 



a 



b c 




i.e. whether we can reduce the gap by using a more complex function instead of the soft threshold 
rj(y;6). The answer is both yes and no. On one hand, there exist provably superior -in minimax 
sense- estimators over JF e . Such estimators are of course more complex than simple soft thresholding. 
On the other hand, better estimators have the same minimax risk M*{e) = (21og(l/e))- 1 (l + o(l)} 
in the very sparse limit, i.e. they improve only the o(l) term as e — > jDJ94bt IDJ94al IJoh02] . 

4 Inference via message passing 

The task of extending the theory of the previous section to the vector case (jl.ip might appear 
daunting. It turns out that such extension is instead possible in specific high-dimensional limits. The 
key step consists in introducing an appropriate message passing algorithm to solve the optimization 
problem (|2.8|) and then analyzing its behavior. 

4.1 The min-sum algorithm 

We start by considering the min-sum algorithm. Min-sum is a popular optimization algorithm 
for graph-structured cost functions (see for instance |Pea88l IJor981 IMM091 IMR07] and references 
therein). In order to introduce the algorithm, we consider a general cost function over x = (x\, . . . , x n ), 
that decomposes according to a factor graph as the one shown in Fig. [TJ 

C(x) = ^2c a {x da )+^2c i (x i ). (4.1) 

Here F is the set of m factor nodes (squares in Fig. [T]) and V is the set of n variable nodes (circles in 
the same figure). Further da is the set of neighbors of node a and xg a = (x{ : i G da). The min-sum 
algorithm is an iterative algorithm of the belief-propagation type. Its basic variables are messages: 
a message is associated to each directed edge in the underlying factor graph. In the present case, 
messages are functions on the optimization variables, and we will denote them as Jj-^ a (xi) (from 
variable to factor), J^^Xi) (from factor to variable), with t indicating the iteration number. Figure 
[5] describes the association of messages to directed edges in the factor graph. Messages are meaningful 
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up to an additive constant, and therefore we will use the special symbol = to denote identity up 
to an additive constant independent of the argument X{. At the i-th iteration they are updated as 
follows^] 

JtXlfri) = C ^)+ E JLM), (4-2) 

b£di\a 

Jl+ifa) = min {C a (x da ) + Y, J Ua( x j)}- ( 4 - 3 ) 

j£oa\i 

Eventually, the optimum is approximated by 

= argmin J- +1 (xi) , (4.4) 

J^\ Xi ) - (H{xi) + Y,JLil?i) ( 4 - 5 ) 

There exists a vast literature justifying the use of algorithms of this type, applying them on concrete 
problems, and developing modifications of the basic iteration with better properties |Pea88t IJor98t 
IMM091 IMR071 IWJ081 IKF09j . Here we limit ourselves to recalling that the iteration P~3j) 
can be regarded as a dynamic programming iteration that computes the minimum cost when the 
underlying graph is a tree. Its application to loopy graphs (i.e., graphs with closed loops) is not 
generally guaranteed to converge. 

At this point we notice that the LASSO cost function Eq. (|2,8p can be decomposed as in Eq. (|4.ip . 

C A ,y{x) ^l^FiVa-Alxf + X^v^. (4.6) 

The min-sum updates read 

JfXlfa) = + Y JLifa), (4-7) 



Jl^ixi) = min{\(y a -A T a x) 2 + Y J Ua( x i)} • ( U 



j&da\i 



4.2 Simplifying min-sum by quadratic approximation 

Unfortunately, an exact implementation of the min-sum iteration appears extremely difficult because 
it requires to keep track of 2mn messages, each being a function on the real axis. A possible approach 
consists in developing numerical approximations to the messages. This line of research was initiated 
in [SBBlOj . 

Here we will overview an alternative approach that consists in deriving analytical approximations 
[DMM09] IDMMlOa] DMMlOb] . Its advantage is that it leads to a remarkably simple algorithm, which 
will be discussed in the next section. In order to justify this algorithm we will first derive a simplified 
message passing algorithm, whose messages are simple real numbers (instead of functions), and then 
(in the next section) reduce the number of messages from 2mn to m + n. 

lr The reader will notice that for a dense matrix A, di = [n] and da = [m]. We will nevertheless stick to the more 
general notation, since it is somewhat more transparent. 
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Throughout the derivation we shall assume that the matrix A is normalized in such a way that 
its columns have zero mean and unit £2 norm. Explicitly, we have X^a=i Aai = and YUa=i A ai = 1- 
In fact it is only sufficient that these conditions are satisfied asymptotically for large system sizes. 
Since however we are only presenting a heuristic argument, we defer a precise formulation of this 
assumption until Section 16.21 We also assume that its entries have roughly the same magnitude 
0{\/ y/m). Finally, we assume that m scales linearly with n. These assumptions are verified by 
many examples of sensing matrices in compressed sensing, e.g. random matrices with i.i.d. entries 
or random Fourier sections. Modifications of the basic algorithm that cope with strong violations of 
these assumptions are discussed in [BM10 . 

It is easy to see by induction that the messages J*_ s>a (xj), J*_^(a;j) remain, for any t, convex 
functions, provided they are initialized as convex functions at t = 0. In order to simplify the min- 
sum equations, we will approximate them by quadratic functions. Our first step consists in noticing 
that, as a consequence of Eq. (|4.8j) . the function J*_^(:cj) depends on its argument only through the 
combination A a iX{. Since A a i <C 1, we can approximate this dependence through a Taylor expansion 
(without loss of generality setting J*_^ i (0) = 0) : 

3Ui(*i) = -aUMaiXi) + ^{^{AaiXif + 0{Al iX f) . (4.9) 

The reason for stopping this expansion at third order should become clear in a moment. Indeed 
substituting in Eq. (|4.7p we get 

J&ifo) = AM - ( E A u<*U) ^ + \( E A uPU^l + 0{nA\xj) . (4.10) 

badi\a b&di\a 

Since A a i = 0(l/y / n), the last term is negligible. At this point we want to approximate J\^ a by its 
second order Taylor expansion around its minimum. The reason for this is that only this order of 
the expansion matters when plugging these messages in Eq. f|4.8j> to compute a*^, Pa^-i- We thus 
define the quantities x\_^ a , 7*_s. a as parameters of this Taylor expansion: 

jU a (xi) = —J— (Xi ~ xUa? + 0({Xi ~ X*_J 3 ) . (4.11) 
I i— >a 

Here we include also the case in which the minimum of J^ a (xi) is achieved at Xi = (and hence 
the function is not differentiable at its minimum) by letting 7*_ > . = in that case. Comparing 
Eqs. (|4.10p and (|4.1ip . and recalling the definition of i]( ■ ; • ), cf. Eq. (|3.6p . we get 



x 



^ = t7(ai;a a ), 7^ = ^(ai;a 2 ), (4.12) 
where rf ( • ; • ) denotes the derivative of rj with respect to its first argument and we defined 

_ J2b&di\a Abiotic A , i o \ 

3l =V A 2 at ' a 2 = — . (4.13) 

2^b&di\a ^biPb-ti l~,b£di\a ^biPb-ti 

Finally, by plugging the parametrization (|4.1ip in Eq. (|4.8p and comparing with Eq. (|4.9p . we can 
compute the parameters c»4-«> Pa^-i- A l° n S but straightforward calculation yields 



t 



1 



£ A <V X Ua}> ( 4 - 14 ) 



+ 2^jdda\i ajlj-^a j £ da\i 

PUi = rr^ 1 ,2 - t • ( 4 - 15 ) 



_l_ y , A 2 ^ 

' l—i jada\i aj lj- 
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Equations (|4.12|) to (|4.15p define a message passing algorithm that is considerably simpler than the 
original min-sum algorithm: each message consists of a pair of real numbers, namely {x\^ a , 7*_ > . ) for 
variable-to-factor messages and (a a _>j, /3 a -n) for factor-to-variable messages. In the next section we 
will simplify it further and construct an algorithm (AMP) with several interesting properties. Let 
us pause a moment for making two observations: 

1. The soft-thresholding operator that played an important role in the scalar case, cf. Eq. (|3|), 
reappeared in Eq. (|4.12p . Notice however that the threshold value that follows as a consequence 
of our derivation is not the naive one, namely equal to the regularization parameter A, but rather 
a rescaled one. 

2. Our derivation leveraged on the assumption that the matrix entries A a i are all of the same 
order, namely 0(l/y/m). It would be interesting to repeat the above derivation under different 
assumptions on the sensing matrix. 

5 Approximate message passing 

The algorithm derived above is still complex in that its memory requirements scale proportionally to 
the product of the number of dimensions of the signal and of the number of measurements. Further, its 
computational complexity per iteration scales quadratically as well. In this section we will introduce 
a simpler algorithm, and subsequently discuss its derivation from the one in the previous section. 

5.1 The AMP algorithm, some of its properties, . . . 

The AMP (for approximate message passing) algorithm is parameterized by two sequences of scalars: 
the thresholds {9t}t>o and the 'reaction terms' {bj}t>o- Starting with initial condition x° = 0, it 
constructs a sequence of estimates x t £ R n , and residuals r* £ R m , according to the following 
iteration 



for all £ > (with convention r _1 = 0). Here and below, given a scalar function / : R — > R, and a 
vector u G M. e , we adopt the convention of denoting by f(u) the vector (f(ui), . . . , f(ui)). 

The choice of parameters {6t}t>o and {bt}t>o is tightly constrained by the connection with the 
min-sum algorithm, as it will be discussed below, but the connection with the LASSO is more general. 
Indeed, as formalized by the proposition below, general sequences {6t}t>o and {bt}t>o can be used 
as far as (x*, z*) converges. 

Proposition 5.1. Let (x*,r*) be a fixed point of the iteration h5. 1}) . \5.2§ for 9t = 0, bj = b fixed. 
Then x* is a minimum of the LASSO cost function K2. 8\) for 



x 



,t+i 



rj(x t + A T r f ;6 t ) 
y — Ax 1 + bt r* -1 



(5.1) 
(5.2) 



A = 0(l-b). 



(5.3) 



Proof. From Eq. (I5.ip we get the fixed point condition 



x* + 9v = x* + A T r* 



(5.4) 
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for v S M. n such that Vi = sign(x*) if x* ^ and Vi E [— 1,+1] otherwise. In other words, v is a 
subgradient of the £i-norm at x*, v € <9||x*||i. Further from Eq. f)5.2[) we get (1 — b)r* = y — Ax*. 
Substituting in the above equation, we get 

6(1 - b)v* = A T (y - Ax*) , 

which is just the stationarity condition for the LASSO cost function if A = 6(1 — b). □ 

As a consequence of this proposition, if we find sequences {6t}t>o, {bt}t>o that converge, and 
such that the estimates x* converge as well, then we are guaranteed that the limit is a LASSO 
optimum. The connection with the message passing min-sum algorithm (see Section 15. 2|) implies an 
unambiguous prescription for bt- 

b t = — \\x% , (5.5) 
m 

where ||«||o denotes the pseudo-norm of vector u, i.e. the number of its non-zero components. The 
choice of the sequence of thresholds {#t}t>o is somewhat more flexible. Recalling the discussion of 
the scalar case, it appears to be a good choice to use 6 t = ar t where a > and Tt is the root mean 
square error of the un-thresholded estimate (x* + A T r t ). It can be shown that the latter is (in an 
high-dimensional setting) well approximated by (||r* W^/fn) 1 / 2 . We thus obtain the prescription 

6 t = a%, T? = -||r*|H. (5.6) 

m 

Alternative estimators can be used instead of % as defined above. For instance, the median of 
{l r il}ie[-m]i can De used to define the alternative estimator: 

? ' = $-1(3/4) |r<1 ^' (5 - 7) 

where \u\^ is the ^-th largest magnitude among the entries of a vector u, and <3?~ 1 (3/4) ~ 0.6745 
denotes the median of the absolute values of a Gaussian random variable. 

By Proposition 15. 1\ if the iteration converges to (x, r), then this is a minimum of the LASSO cost 
function, with regularization parameter 

A = aWl-&l (5.8) 



m \ m 

(in case the threshold is chosen as per Eq. (j5.6fl ). While the relation between a and A is not fully 
explicit (it requires to find the optimum x), in practice a is as useful as A: both play the role of 
knobs that adjust the level of sparsity of the seeked solution. 

We conclude by noting that the AMP algorithm (|5.ip . (|5.2p is quite close to iterative soft thresh- 
olding (1ST), a well known algorithm for the same problem that proceeds by 

= r?(x* + A T r l ; 9 t ) , (5.9) 
r* = y-Ax 1 . (5.10) 

The only (but important) difference lies in the introduction of the term b^r' -1 in the second equation, 
cf. Eq. (|5.2p . This can be regarded as a momentum term with a very specific prescription on its size, 
cf. Eq. (|5.5p . A similar term -with motivations analogous to the one presented below- is popular 
under the name of 'Onsager term' in statistical physics [( )ns361 ITAF771 IMP V87] . 
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5.2 ... and its derivation 



In this section we present an heuristic derivation of the AMP iteration in Eqs. (|5.ip . (|5.2p starting 
from the standard message passing formulation given by Eq. (|4.12p to (|4.15p . Our objective is 
to develop an intuitive understanding of the AMP iteration, as well as of the prescription (15. 5p . 
Throughout our argument, we treat m as scaling linearly with n. A full justification of the derivation 
presented here is beyond the scope of this review: the actual rigorous analysis of the AMP algorithm 
goes through an indirect and very technical mathematical proof [BMllj . 

We start by noticing that the sums Y,j£da\i A aj"fj^a an d Y,b&di\a ^bi^b^i are sums of 6(n) terms, 
each of order 1/n (because A 2 ai = 0(l/n)). Notice that the terms in these sums are not independent: 
nevertheless by analogy to what happens in the case of sparse graphs [MT061 IMon08t IRU081 IAS03] . 
one can hope that dependencies are weak. It is then reasonable to think that a law of large numbers 
applies and that therefore these sums can be replaced by quantities that do not depend on the 
instance or on the row/column index. 

We then let r a _ ¥i = a*_>j//3*_>j and rewrite the message passing iteration, cf. Eqs. (14.121) to 
flggD , as 

r a-+i = Ua— ^ A a jX l ^ a , (5-11) 
je[n]\i 

<¥a = v( E Axrln-A), (5.12) 

feS[m]\a 

where 9t ~ A/ Ylb£di\a ^li^b^i * s ~ as mentioned- treated as independent of b. 

Notice that on the right-hand side of both equations above, the messages appear in sums over 
0(n) terms. Consider for instance the messages {r*^.j}j g r n ] for a fixed node a G [m]. These depend 
on i £ [n] only because the term excluded from the sum on the right hand side of Eq. (|5.1ip changes. 
It is therefore natural to guess that r t a _^ i = r* + <9(n -1 / 2 ) and x*_ s>a = x\ + 0(m~ 1 / 2 ), where r* only 
depends on the index a (and not on i), and x\ only depends on i (and not on a). 

A naive approximation would consist in neglecting the 0{n~ l l 2 ) correction but this approximation 
turns out to be inaccurate even in the large-n limit. We instead set 

■ = r A- /)r* ■ — t A- &t.. 

Substituting in Eqs. (|5TT|) and Kl2h . we get 

J'6[n] 

x* +1 + Sxl+\ = r ? (E A u {r\ + SrUi) ~ 4»(r< + Sr^); 6 t ) . 

be[m] 

We will now drop the terms that are negligible without writing explicitly the error terms. First of all 
notice that single terms of the type A i<Jr*_^ are of order 1/n and can be safely neglected. Indeed 
fi r a->i = 0{n~ l l 2 ) by our ansatz, and A w i = 0{n~ l l 2 ) by definition. We get 

r a + ^ r a-M = Ua ~ ^ ] A a j(Xj + SXj_^ a ) + A a iX^ , 
be[m] 
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We next expand the second equation to linear order in 5x\^ ta and Sr^f 

z a + ^ r a->i = Ua — A a j (Xj + dx*^^ + A a ix\ , 

ie[n] 



be[m] be[m] 

The careful reader might be puzzled by the fact that the soft thresholding function u i— > ^(u; #) is 
non-differentiable at u £ {+0, — 0}. However, the rigorous analysis carried out in ISM 1 1 through a 
different (and more technical) methods reveals that almost-everywhere differentiability is sufficient 
here. 

Notice that the last term on the right hand side of the first equation above is the only one 
dependent on i, and we can therefore identify this term with Sr^^. We obtain the decomposition 

r l = ya- ^2 Aa A x ) + Sxt j^a) , (5-13) 
5^ = A ai x\. (5.14) 
Analogously for the second equation we get 



= V 



^2 A bl (r t b + 5rU i y,e t ) , (5.15) 



6G[-m] 



5x] 



t+i 



i— >a 



feS[m] 

Substituting Eq. (I5.14p in Eq. (I5.15P to eliminate Srl^ we get 



-r/( E Auirl + SrU^G^aiTl. (5.16) 



6S[m] fe6[m] 



and using the normalization of A, we get ^feefm] — ^ 1> whence 

= 7,(3;* + A T r t \d t ) . (5.18) 
Analogously substituting Eq. (|5,16p in (I5.13p . we get 



ya-Y. A «i x \ + E A l^'( x T l + (^ t_1 )i; ^y- 1 . (5.i9) 

ie[n] je[n] 



Again, using the law of large numbers and the normalization of A, we get 

E + (A r r^) 1 -e t -i) - 1 E + (^V-V.fc-i) = -II^Ho , (5.20) 

je[n] ie[n] 

whence substituting in (|5.19p . we obtain Eq. (|5.2p . with the prescription (|5.5p for the Onsager term. 
This finishes our derivation. 



16 




-1 -0.5 0.5 1 1.5 2 2.5 3 -1 -0.5 0.5 1 1.5 2 2.5 3 



(x l + A T r t ) i {x l + A T r t ) i 

Figure 6: Distributions of un-thrcsholded estimates for AMP (left) and 1ST (right), after t = 10 iterations. 
These data were obtained using sensing matrices with m = 2000, n = 4000 and i.i.d. entries uniform in 
{ + l/v / m, — 1/^/m). The signal x contained 500 non-zero entries uniform in { + 1, —1}. A total of 40 instances 
was used to build the histograms. Blue lines are Gaussian fits and vertical lines represent the fitted mean. 



6 High-dimensional analysis 

The AMP algorithm enjoys several unique properties. In particular it admits an asymptotically exact 
analysis along sequences of instances of diverging size. This is quite remarkable, since all analysis 
available for other algorithms that solve the LASSO hold only 'up to undetermined constants'. 

In particular in the large system limit (and with the exception of a 'phase transition' line), AMP 
can be shown to converge exponentially fast to the LASSO optimum. Hence the analysis of AMP 
yields asymptotically exact predictions on the behavior of the LASSO, including in particular the 
asymptotic mean square error per variable. 

6.1 Some numerical experiments with AMP 

How is it possible that an asymptotically exact analysis of AMP can be carried out? Figure [6] 
illustrates the key point. It shows the distribution of un-thresholded estimates {x l + A T r t )i for 
coordinates i such that the original signal had value X{ = +1. These estimates were obtained 
using the AMP algorithm (|5.1[) . (|5.2p with choice (|5.5p of (plot on the left) and the iterative 
soft thresholding algorithm (|5.9|) . (|5.10p (plot on the right). The same instances (i.e. the same 
matrices A and measurement vectors y) were used in the two cases, but the resulting distributions 
are dramatically different. In the case of AMP, the distribution is close to Gaussian, with mean on 
the correct value, xi = +1. For iterative soft thresholding the estimates do not have the correct 
mean and are not Gaussian. 

This phenomenon appears here as an empirical observation, valid for a specific iteration number 
t, and specific dimensions m, n. In the next section we will explain that it can be proved rigorously in 
the limit of a large number of dimensions, for all values of iteration number t. Namely, as m, n — > oo 
at t fixed, the empirical distribution of {(x* + A T r t )i — Xj} i6 [ n ] converges to a gaussian distribution, 
when x l and r are computed using AMP. The variance of this distribution depends on t, and the its 
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Figure 7: Evolution of the mean square error as a function of the number of iterations for AMP 
(left) and iterative soft thresholding (right), for random measurement matrices A, with i.i.d. entries 
A a i £ {+l/y/m, —1/y/rn} uniformly. Notice the different scales used for the horizontal axis! Here 
n = 8000, m = 1600. Different curves depends to different levels of sparsity. The number of non-zero 
entries of the signal x is, for the various curves, ||x||o = 800, 1200, 1600, 1800 (from bottom to top). 



evolution with t can be computed exactly. Viceversa, for iterative soft thresholding, the distribution 
of the same quantities remains non-gaussian. 

This dramatic difference remains true for any t, even when AMP and 1ST converge the same 
minimum. Indeed even at the fixed point, the resulting residual r l is different in the two algorithms, 
as a consequence of the introduction Onsager term. 

More importantly, the two algorithms differ dramatically in the rate of convergence. One can 
interpret the vector [x l + A T r l ) — x as 'effective noise' after t iterations. Both AMP and 1ST 
'denoise' the vector (x* + ^4 T r*) using the soft thresholding operator. As discussed in Section [3l the 
soft thresholding operator is essentially optimal for denoising in gaussian noise. This suggests that 
AMP should have superior performances (in the sense of faster convergence to the LASSO minimum) 
with respect to simple 1ST. 

Figure [7] presents the results of a small experiment confirming this expectation. Measurement 
matrices A with dimensions m = 1600, n = 8000, were generated randomly with i.i.d. entries 
A a i £ {+l/y/m,—l/y/m} uniformly at random. We consider here the problem of reconstructing 
a signal x with entries Xi £ {+1,0,-1} from noiseless measurements y = Ax, for different levels 
of sparsity. Thresholds were set according to the prescription (|5.6p with a = 1.41 for AMP (the 
asymptotic theory of [DMM09 yields the prescription a ~ 1.40814) and a = 1.8 for 1ST (optimized 
empirically). For the latter algorithm, the matrix A was rescaled in order to get an operator norm 
\\A\\ 2 = 0.95. 

Convergence to the original signal x is slower and slower as this becomes less and less sparse^. 
Overall, AMP appears to be at least 10 times faster even on the sparsest vectors (lowest curves in 

2 Indeed basis pursuit (i.e. reconstruction via £i minimization) fails with high probability if ||x||o/m > 0.243574, see 
|Don06| and Section \6. 6 1 



18 



the figure). 



6.2 State evolution 

State evolution describes the asymptotic limit of the AMP estimates asm,n-> oo, for any fixed t. 
The word 'evolution' refers to the fact that one obtains an 'effective' evolution with t. The word 
'state' refers to the fact that the algorithm behavior is captured in this limit by a single parameter 
(a state) Tt 6 M. 

We will consider sequences of instances of increasing sizes, along which the AMP algorithm 
behavior admits a non-trivial limit. An instance is completely determined by the measurement matrix 
A, the signal x, and the noise vector w, the vector of measurements y being given by y = Ax + w, 
cf. Eq. While rigorous results have been proved so far only in the case in which the sensing 

matrices A have i.i.d. Gaussian entries, it is nevertheless useful to collect a few basic properties that 
the sequence needs to satisfy in order for state evolution to hold. 

Definition 1. The sequence of instances {x(n), w(n), A(n)} n£ ?$ indexed by n is said to be a converg- 
ing sequence if x(n) E M. n , w(n) G M. m , A(n) G flj mxn with m = m(n) is such that m/n — > 6 € (0, oo), 
and in addition the following conditions hold: 

(a) The empirical distribution of the entries of x(n) converges weakly to a probability measure po 
on M with bounded second moment. Further n^ 1 Y17=i x i{ n ) 2 ~* ^poi^oi- 

(b) The empirical distribution of the entries of w(n) converges weakly to a probability measure pw 
on R with bounded second moment. Further m~ l YliLi w i( n ) 2 ~~ ^ ^-pwiW 2 } = o~ 2 . 

(c) //{ej}i<j< n , e« £ R n denotes the canonical basis, then linin^oo maxj g r n ] ||^4(ri)ej || 2 = 1, 
lim^oo min i6[n ] ||A(n)ei|| 2 = 1. 

As mentioned above, rigorous results have been proved only for a subclass of converging sequences, 
namely under the assumption that the matrices A(n) have i.i.d. Gaussian entries. Notice that such 
matrices satisfy condition (c) by elementary tail bounds on x-square random variables. The same 
condition is satisfied by matrices with i.i.d. subgaussian entries thanks to concentration inequalities 
[LedOlj . 

On the other hand, numerical simulations show that the same limit behavior should apply within a 
much broader domain, including for instance random matrices with i.i.d. entries under an appropriate 
moment condition. This universality phenomenon is well-known in random matrix theory whereby 
asymptotic results initially established for Gaussian matrices were subsequently proved for broader 
classes of matrices. Rigorous evidence in this direction is presented in |KM10b| . This paper shows 
that the normalized cost min^gRn C^ n ^ y ^(x)/n has a limit for n — > 00, which is universal with 
respect to random matrices A with i.i.d. entries. (More precisely, it is universal provided E{j4jj} = 0, 
Ej^} = 1/m and E{^} < C/m 3 for some n-independent constant C .) 

For a converging sequence of instances {x(n), w(n), A(n)} n ^fq, and an arbitrary sequence of 
thresholds {6t}t>o (independent of n), the AMP iteration (15. ip . (|5.2p admits a high-dimensional 
limit which can be characterized exactly, provided Eq. (|5.5p is used for fixing the Onsager term. 
This limit is given in terms of the trajectory of a simple one-dimensional iteration termed state 
evolution which we will describe next. 
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Define the sequence {r 2 }t>o by setting Tq = a 2 + E{Xq}/5 (for X ~ p an d o 2 = E{W 2 }, 
W ~ pw) and letting, for all t > 0: 

^+1 = F(r t 2 ,^), (6.1) 
F(r 2 ,0) = a 2 + ^E{[ V (Xo + TZ;e)-X } 2 }, (6.2) 

where Z ~ N(0, 1) is independent of Xq ~ po- Notice that the function F depends implicitly on the 
law pq. Further, the state evolution {r 2 }t>o depends on the specific converging sequence through 
the law po, and the second moment of the noise E pvF {W 2 }, cf. Definition [TJ 

We say a function ifj : M k — > R is pseudo-Lipschitz if there exist a constant L > such that for 
all x,y € M fc : \tp(x) - ip{y)\ < L(l + ||x|| 2 + 1 1 2/ 1 1 2 ) 1 1 ^ - y||a- (This is a special case of the definition 
used in [BMllj where such a function is called pseudo-Lipschitz of order 2.) 

The following theorem was conjectured in [DMM09] . and proved in |BM11| . It shows that the 
behavior of AMP can be tracked by the above state evolution recursion. 

Theorem 6.1 ( |BM11| ). Let {x(n), w(n), ^4(n)} n6 N be a converging sequence of instances with the 
entries of A(n) i.i.d. normal with mean and variance 1/m, while the signals x(n) and noise vectors 
w(n) satisfy the hypotheses of Definition^ Let ip\ : R — > M, ip2 : R x R — > M. be pseudo-Lipschitz 
functions. Finally, let {x t }t>o, {^ 4 }i>o ^ e the sequence of estimates and residuals produced by AMP, 
cf. Eqs. \5.1\) . 15. 2\) . Then, almost surely 



hm-f>i(<) = EU^rtZ)}, (6.3) 

o=l 

1 11 

lim -V^ 2 (x* +1 ,x i ) = E\ip 2 ( V (X + T t Z;O t ),X )}, (6.4) 



where Z ~ N(0, 1) is independent of Xq ~ po- 

It is worth pausing for a few remarks. 

Remark 1. Theorem 16.11 holds for any choice of the sequence of thresholds {6t\t>Q- It does not 
require -for instance- that the latter converge. Indeed [BMllj proves a more general result that 
holds for a a broad class of approximate message passing algorithms. The more general theorem 
establishes the validity of state evolution in this broad context. 

For instance, the soft thresholding functions r/( • ; 9t) can be replaced by a generic sequence of 
Lipschitz continuous functions, provided the coefficients bt in Eq. ()5.2[) are suitably modified. 



Remark 2. This theorem does not require the vectors x(n) to be sparse. The use of other functions 
instead of the soft thresholding functions rj(- ;6t) in the algorithm can be useful for estimating such 
non-sparse vectors. 

Alternative nonlinear ities, can also be useful when additional information on the entries of x(n) 
is available. 

Remark 3. While the theorem requires the matrices A(n) to be random, neither the signal x(n) 
nor the noise vectors w(n) need to be random. They are generic deterministic sequences of vectors 
under the conditions of Definition [TJ 
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Figure 8: Mapping r 2 H- F(T 2 ,ar) for a = 2, 5 = 0.64, a 2 = 0.2, p {{+l}) = Po({-l}) = 0.064 and 
Po ({0}) = 0-872. 



The fundamental reason for this universality is that the matrix A is both row and column ex- 
changeable. Row exchangeability guarantees universality with respect to the signals x(n), while 
column exchangeability guarantees universality with respect to the noise win). To see why, observe 
that, by row exchangeability (for instance), x(n) can be replaced by the random vector obtained 
by randomly permuting its entries. Now, the distribution of such a random vector is very close (in 
appropriate sense) to the one of a random vector with i.i.d. entries whose distribution matches the 
empirical distribution of x(n). 

Theorem 16.11 strongly supports both the use of soft thresholding, and the choice of the threshold 
level in Eq. (|5.6[) or (|5.7p . Indeed Eq. (|6.3p states that the components of r* are approximately i.i.d. 
N(0, r 2 ), and hence both definitions of % in Eq. (|5.6p or (|5.T|) provide consistent estimators of Tt- 
Further, Eq. (|6.3p implies that the components of the deviation (x t +A T r t — x) are also approximately 
i.i.d. N(0,r 2 ). In other words, the estimate [x l + A T r l ) is equal to the actual signal plus noise of 
variance r 2 , as illustrated in Fig. [6l According to our discussion of scalar estimation in Section [3l 
the correct way of reducing the noise is to apply soft thresholding with threshold level art- 

The choice 6t = art with a fixed has another important advantage. In this case, the sequence 
{Tt}t>o is determined by the one-dimensional recursion 

r 2 +1 = F(r 2 ,ari). (6.5) 

The function r 2 i— >■ F(r 2 , err) depends on the distribution of Xq as well as on the other parameters of 
the problem. An example is plotted in Fig. 0. It turns out that the behavior shown here is generic: 
the function is always non-decreasing and concave. This remark allows to easily prove the following. 
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Proposition 6.2 ( [DM MlOb] ) . Let a m \ n = a m i n (5) be the unique non-negative solution of the equa- 
tion 

(l + a 2 )$(-a) -a(j){a) = -, (6.6) 

with 4>{z) = e~ z ' ! / 2 / '\p2sK the standard Gaussian density and $(z) = <p(x) dx. 

For any a 2 > 0, a > o m i n (<5) ; the fixed point equation r 2 = F(r 2 ,ar) admits a unique solution. 
Denoting by r* = r* (a) this solution, we have lim^oo rt = r* (a) . 

It can also be shown that, under the choice 6t = ctr t , convergence is exponentially fast unless 
the problem parameters take some 'exceptional' values (namely on the phase transition boundary 
discussed below). 



6.3 The risk of the LASSO 

State evolution provides a scaling limit of the AMP dynamics in the high-dimensional setting. By 
showing that AMP converges to the LASSO estimator, one can transfer this information to a scaling 
limit result of the LASSO estimator itself. 

Before stating the limit, we have to describe a calibration mapping between the AMP parameter 
a (that defines the sequence of thresholds {9t}t>o) and the LASSO regularization parameter A. The 
connection was first introduced in [DMMlOb] . 

We define the function a \-t X(a) on (a m i n (<5), oo), by 



A (a) = or* 



1--F{\X + t*Z\ >ar t } 



(6.7) 



where r* = r*(a) is the state evolution fixed point defined as per Proposition 16.21 Notice that 
this relation corresponds to the scaling limit of the general relation (|5.3p . provided we assume that 
the solution of the LASSO optimization problem (|2.8p is indeed described by the fixed point of 
state evolution (equivalently, by its t — > oo limit). This follows by noting that 0t —> ctr* and that 
||x||o/n — > E{r/(Xo +t*Z; err*)}. While this is just an interpretation of the definition (]6.7|) . the result 
presented next implies that the interpretation is indeed correct. 

In the following we will need to invert the function a h-» A(q). We thus define a : (0, oo) — > 
(amim co) in such a way that 



(A) G { a G (a min , oo) : A(a) = A} . 



The fact that the right-hand side is non-empty, and therefore the function A i— > a(X) is well defined, 
is part of the main result of this section. 

Theorem 6.3. Let {x(n), w(n), A(n)} ne N ° e a converging sequence of instances with the entries 
of A(n) i.i.d. normal with mean and variance 1/m. Denote by x{\) the LASSO estimator for 
instance (x(n),w(n), A(n)), with a 2 , A > 0, and let tp : R x R — > R be a pseudo-Lipschitz function. 
Then, almost surely 

1 n 

jKm-ZM^' 2 *) =®\i>{v(Xo + T*Z;9*),X Q )} , (6.8) 
=i 



n— loo n 



where Z ~ N(0, 1) is independent of Xq ~ po, r* = r*(a(A)) and 6* = a(A)T*(a(A)). 
Further, the function A i— > a(X) is well defined and unique on (0,oo). 
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The assumption of a converging problem sequence is important for the result to hold, while 
the hypothesis of Gaussian measurement matrices A{n) is necessary for the proof technique to be 
applicable. On the other hand, the restrictions A, a 2 > 0, and P{Xo 7^ 0} > (whence r* 7^ using 
Eq. (|6.7p ) are made in order to avoid technical complications due to degenerate cases. Such cases 
can be resolved by continuity arguments. 

Let us emphasize that some of the remarks made in the case of state evolution, cf. Theorem 16. 11 
hold for the last theorem as well. More precisely. 

Remark 4. Theorem 16.31 does not require either the signal x(n) or the noise vectors w(n) to be 
random. They are generic deterministic sequences of vectors under the conditions of Definition [TJ 

In particular, it does not require the vectors x(n) to be sparse. Lack of sparsity will reflect in a 
large risk as computed through the mean square error computed through Eq. (|6.8p . 

On the other hand, when restricting x(n) to be k sparse for k = ne (i.e. to be in the class J- n ^k)i 
one can derive asymptotically exact estimates for the minimax risk over this class. This will be 
further discussed in Section 16.61 

Remark 5. As a special case, for noiseless measurements a = 0, and as A — > 0, the above formulae 
describe the asymptotic risk (e.g. mean square error) for the basis pursuit estimator, minimize ||x|| 
subject to y = Ax. For sparse signals x(n) € J- n ^, k = np5, the risk vanishes below a certain phase 
transition line p < p c (5)- this point is further discussed in Section HTBl 

Let us now discuss some limitations of this result. Theorem 16 . 3 1 assumes that the entries of matrix 
A are i.i.d. Gaussians. Further, our result is asymptotic, and one might wonder how accurate it is 
for instances of moderate dimensions. 

Numerical simulations were carried out in [DMMldb , BBMlT] and suggest that the result is 
universal over a broader class of matrices and that is relevant already for n of the order of a few 
hundreds. As an illustration, we present in Figs. [9] and [10] the outcome of such simulations for two 
types of random matrices. Simulations with real data can be found in |BBMllj . We generated the 
signal vector randomly with entries in {+1,0,-1} and P(xo,« = +1) = P(xo,i = —1) = 0.064. The 
noise vector w was generated by using i.i.d. N (0,0.2) entries. 

We solved the LASSO problem (12. 8h and computed estimator x using CVX, a package for specifying 
and solving convex programs |GB10j and 0WLQN, a package for solving large-scale versions of LASSO 
[A.T07] . We used several values of A between and 2 and n equal to 200, 500, 1000, and 2000. The 
aspect ratio of matrices was fixed in all cases to 8 = 0.64. For each case, the point (A, MSE) was 
plotted and the results are shown in the figures. Continuous lines corresponds to the asymptotic 
prediction by Theorem 16.31 for ip(a,b) = (a — b) 2 , namely 

l im -\\x- x \\ 2 = ¥,{lT 1 {X + nZ;e*)-X () } 2 } = 5(t 2 - a 2 ) . 

The agreement is remarkably good already for n, m of the order of a few hundreds, and deviations 
are consistent with statistical fluctuations. 

The two figures correspond to different entries distributions: (z) Random Gaussian matrices with 
aspect ratio 5 and i.i.d. N(0, 1/m) entries (as in Theorem \6.3\i ; (ii) Random ±1 matrices with 
aspect ratio 5. Each entry is independently equal to +l/y / m or —1/y/m with equal probability. 
The resulting MSE curves are hardly distinguishable. Further evidence towards universality will be 
discussed in Section 16.71 

Notice that the asymptotic prediction has a minimum as a function of A. The location of this 
minimum can be used to select the regularization parameter. 
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Figure 9: Mean square error (MSE) as a function of the regularization parameter A compared to 
the asymptotic prediction for 5 = 0.64 and a 2 = 0.2. Here the measurement matrix A has i.i.d. 
N(0, 1/m) entries. Each point in this plot is generated by finding the LASSO predictor x using a 
measurement vector y = Ax + w for an independent signal vector x, an independent noise vector w, 
and an independent matrix A. 
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Figure 10: As in Fig. [9l but the measurement matrix A has i.i.d. entries that are equal to ±l/y/rn 
with equal probabilities. 
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6.4 A decoupling principle 



There exists a suggestive interpretation of the state evolution result in Theorem 16.11 as well as of 
the scaling limit of the LASSO established in Theorem 16.31 The estimation problem in the vector 
model y = Ax + w reduces -asymptotically- to n uncoupled scalar estimation problems yi = x% + w%. 
However the noise variance is increased from a 2 to t 2 (or r 2 in the case of the LASSO), due to 
'interference' between the original coordinates: 



y = Ax + w <^ < 



yi = xx + w\ 

m = x 2 + W2 , s 

(6.9) 



An analogous phenomenon is well known in statistical physics and probability theory and takes 
sometimes the name of 'correlation decay' [Wei05l IGK07j IMM09] . In the context of CDMA system 
analysis via replica method, the same phenomenon was also called 'decoupling principle' |Tan02l 
IGV05] . 

Notice that the AMP algorithm gives a precise realization of this decoupling principle, since for 
each i G [n], and for each number of iterations t, it produces an estimate, namely (x l + A T r t )i that 
can be considered a realization of the observation yi above. Indeed Theorem 16.11 (see also discussion 
below the theorem) states that (x* + A T r t )i = xi + Wi with wi asymptotically Gaussian with mean 
and variance r 2 . 

The fact that observations of distinct coordinates are asymptotically decoupled is stated precisely 
below. 

Corollary 6.4 (Decoupling principle, |BM11| ). Under the assumption of Theorem \ 6.1\ fix I > 2, 

let ip : R 2e -> R be any Lipschitz function, and denote by E expectation with respect to a uniformly 
random subset of distinct indices J(l), . . . , J(£) 6 [n]. 

Further, for some fixed t > 0, lety t = x t + A T r l € W 1 . Then, almost surely 



J^ E ^(yj(i)>- ■ -iyj(e)> x J(i)>- ■ -i x J{i)) = e{v>(Ao,i + r t Zi,. . . , x 0/ + r t Zg, x 0>1 ,. . . ,x ,i)}, 

for XQ i ~ po and Zi ~ N(0, 1), i = 1, . . . ,£ mutually independent. 



6.5 An heuristic derivation of state evolution 

The state evolution recursion has a simple heuristic description that is useful to present here since 
it clarifies the difficulties involved in the proof. In particular, this description brings up the key role 
played by the 'Onsager term' appearing in Eq. (|5.2p [DMM09J. 

Consider again the recursion (15. lj) . (|5.2j) but introduce the following three modifications: (i) 
Replace the random matrix A with a new independent copy A{t) at each iteration t; (ii) Corre- 
spondingly replace the observation vector y with y t = A(t)x + w; (Hi) Eliminate the last term in the 
update equation for r . We thus get the following dynamics: 

x t+1 = r t (A(t) T r t + x t ;9 t ), (6.10) 
r l = y l - A(t)x l , (6.11) 
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where A(0) , A(l) , A(2) , . . . are i.i.d. matrices of dimensions m x n with i.i.d. entries Ay(t) ~ 
N(0, 1/m). (Notice that, unlike in the rest of the article, we use here the argument of A to denote 
the iteration number, and not the matrix dimensions.) 

This recursion is most conveniently written by eliminating r*: 

x t+l = riiAitfy* + (I - AitfAit))^; t ) , 

= V (x + A(t) T w + B{t){x l - x);6 t ) , (6.12) 

where we defined B(t) = I-A(t) T A(t) G M nxn . Let us stress that this recursion does not correspond 
to any concrete algorithm, since the matrix A changes from iteration to iteration. It is nevertheless 
useful for developing intuition. 

Using the central limit theorem, it is easy to show that each entry of B(t) is approximately 
normal, with zero mean and variance 1/m. Further, distinct entries are approximately pairwise 
independent. Therefore, if we let r f 2 = lim n _ 5 . 00 — xW^/n, we obtain that B(t)(x t — x) converges 
to a vector with i.i.d. normal entries with mean and variance nrf/m = rf /5. Notice that this is 
true because A(t) is independent of {A(s)}i< s <t-i and, in particular, of {x l — x). 

Conditional on w, A(t) T w is a vector of i.i.d. normal entries with mean and variance (l/m)||tu||| 
which converges by assumption to a 2 . A slightly longer exercise shows that these entries are approx- 
imately independent from the ones of B(t)(x t — Xo). Summarizing, each entry of the vector in the 
argument of r\ in Eq. (|6.12p converges to Xo + nZ with Z ~ N(0, 1) independent of Xq, and 

(6.13) 

On the other hand, by Eq. (|6.12j) . each entry of x t+1 — x converges to rj(Xo + r$ Z; 0i) — Xo, and 
therefore 

rf +1 = lim -\\x t+1 -x\\ 2 2 =K{[ V (X + T t Z;e t )-Xo] 2 }. (6.14) 

Using together Eq. (I6.13P and (16.14j) we finally obtain the state evolution recursion, Eq. (|6.1D . 

We conclude that state evolution would hold if the matrix A was drawn independently from 
the same Gaussian distribution at each iteration. In the case of interest, A does not change across 
iterations, and the above argument falls apart because x t and A are dependent. This dependency is 
non- negligible even in the large system limit n — > oo. This point can be clarified by considering the 
1ST algorithm given by Eqs. (15.91) . fl5. lOf) . Numerical studies of iterative soft thresholding |MD10t 
IDMM09] show that its behavior is dramatically different from the one of AMP and in particular 
state evolution does not hold for 1ST, even in the large system limit. 

This is not a surprise: the correlations between A and x l simply cannot be neglected. On the 
other hand, adding the Onsager term leads to an asymptotic cancelation of these correlations. As a 
consequence, state evolution holds for the AMP iteration. 

6.6 The noise sensitivity phase transition 

The formalism developed so far allows to extend the minimax analysis carried out in the scalar case 
in Section [3] to the vector estimation problem [DMMlOb] . We define the LASSO mean square error 



^2 J2 , 1-2 

T t = & + j r i , 

t? = lim —\\x t — 

n— ¥oo n 
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Figure 11: Noise sensitivity phase transition in the plane (8, p) (here 6 = m/n is the undersampling ratio 
and p = ||x||o/w is the number of non-zero coefficients per measurement). Red line: The phase transition 
boundary p = p c (S). Blue lines: Level curves for the LASSO minimax M*(S,p). Notice that M*(6, p) t oo as 
pt Pee- 



per coordinate when the empirical distribution of the signal converges to po, as 

MSE(cj 2 ;po,A) = lim - E{p(A) - x\\l] , (6.15) 

n— >oo n 

where the limit is taken along a converging sequence. This quantity can be computed using Theorem 
16.31 for any specific distribution p^. 

We consider again the sparsity class J- e with e = p5. Hence p = ||x||o/r?i measures the number 
of non-zero coordinates per measurement. Taking the worst case MSE over this class, and then the 
minimum over the regularization parameter A, we get a result that depends on p, 5, as well as on 
the noise level a 2 . The dependence on a 2 must be linear because the class is scale invariant, and 
we obtain therefore 

inf sup MSE(a 2 ; Po ,X) = M*(5,p)a 2 , (6.16) 

for some function (5, p) \— > M*(5,p). We call this the LASSO minimax risk. It can be interpreted as 
the sensitivity (in terms of mean square error) of the LASSO estimator to noise in the measurements. 

It is clear that the prediction for MSE(a 2 ;po, A) provided by Theorem 16.31 can be used to charac- 
terize the LASSO minimax risk. What is remarkable is that the resulting formula is so simple. 

Theorem 6.5 ( [DMM10b| ). Assume the hypotheses of Theorem \6.3\ and recall that M*(e) denotes 
the soft thresholding minimax risk over the class JF e cf. Eqs. A3.8\) . h3.10\) . Further let p c {§) be the 
unique solution of p = M^(pS). 
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Then for any p < p c (5) the LASSO minimax risk is bounded and given by 



Viceversa, for any p > p c (5), we have M*(5,p) = oo. 

Figure [TT] shows the location of the noise sensitivity boundary p c (5) as well as the level lines of 
M*(5,p) for p < p c (5). Above p c (5) the LASSO MSE is not uniformly bounded in terms of the 
measurement noise a 2 . Other estimators (for instance one step of soft thresholding) can offer better 
stability guarantees in this region. 

One remarkable fact is that the phase boundary p = p c (5) coincides with the phase transition 
for £q — t\ equivalence derived earlier by Donoho |Don06] on the basis of random polytope geometry 
results by Affentranger-Schneider [AS92J . The same phase transition was further studied in a series of 
papers by Donoho, Tanner and coworkers [DT051 |DT09| . in connection with the noiseless estimation 
problem. For p < p c estimating x by ^i-norm minimization returns the correct signal with high 
probability (over the choice of the random matrix A). For p > p c (8), ^i-minimization fails. 

Here this phase transition is derived from a completely different perspective as a special case of 
a stronger result. We indeed use a new method -the state evolution analysis of the AMP algorithm- 
which offers quantitative information about the noisy case as well, namely it allows to compute the 
value of M*(5,p) for p < p c (5). Within the present approach, the line p c (5) admits a very simple 
expresson. In parametric form, it is given by 

20(a) 

a + 2(<f>(a) -a*(-o)) ' 1 ' ' 

P = 1-^, ( 6 - 19 ) 
4>(a) 

where <f> and $ are the Gaussian density and Gaussian distribution function, and a £ [0, oo) is the 
parameter. Indeed a has a simple and practically important interpretation as well. Recall that the 
AMP algorithm uses a sequence of thresholds 6% = cert, cf. Eqs. (|5.6p and (|5.7p . How should the 
parameter a be fixed? A very simple prescription is obtained in the noiseless case. In order to 
achieve exact reconstruction for all p < p c (S) for a given an undersampling ratio S, a should be such 
that (5,p c (S)) = (5(a), p(a)) with functions a t- > 5(a), a i-> p(a) defined as in Eq. (|6.18p . (|6.19p . In 
other words, this parametric expression yields each point of the phase boundary as a function of the 
threshold parameter used to achieve it via AMP. 



6.7 On universality 

The main results presented in this section, namely Theorems 16.11 IfPl and HHjI are proved for measure- 
ment matrices with i.i.d. Gaussian entries. As stressed above, it is expected that the same results 
hold for a much broader class of matrices. In particular, they should extend to matrices with i.i.d. 
or weakly correlated entries. For the sake of clarity, it is useful to put forward a formal conjecture, 
that generalizes Theorem 16.31 

Conjecture 6.6. Let {x(n),w(n), A(n)} n ^ be a converging sequence of instances with the entries 
of A(n) i.i.d. with mean Ej^ljj} = 0, variance E{A? } = 1/m and such that E{^4? } < C/m for 
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Figure 12: Mean square error for as a function of the regularization parameter A for a partial Fourier 
matrix (see text). The noise variance is a 2 = 0.2, the undersampling factor 5 = 0.2 and the sparsity 
ratio p = 0.2. Data points are obtained by averaging over 20 realizations, and error bars are 95% 
confidence intervals. The continuous line is the prediction of Theorem 16.31 
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Figure 13: As in Fig. (T2J but for a measurement matrix A which models the analog-to-digital 



converter of TLD + 10 . 
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some fixed constant C. Denote by x(X) the LASSO estimator for instance (x(n),w(n),A(n)), with 
<t 2 , A > 0, and let ip : K. x IR — > IR be a pseudo-Lipschitz function. Then, almost surely 




)} 



(6.20) 



where Z ~ N(0, 1) is independent of Xq ~ pq, r* = T*(a(\)) and 9* = a(\)T*(a(\)) are given by the 
same formulae holding for Gaussian matrices, cf. Section \6.3l 

The conditions formulated in this conjecture are motivated by the universality result in |KM10b) . 
that provides partial evidence towards this claim. Simulations (see for instance Fig. [TU]and jBBMllj ) 
strongly support this claim. 

While proving Conjecture 16.61 is an outstanding mathematical challenge, many measurement 
models of interest do not fit the i.i.d. model. Does the theory developed in these section say 
anything about such measurements? Systematic numerical simulations [DMMlO b, BBMll] reveal 
that, even for highly structured matrices, the same formula 16.201 is either surprisingly close to the 
actual empirical performances. 

As an example, Fig. [12] presents the empirical mean square error for a partial Fourier measurement 
matrix A, as a function of the regularization parameter A. The matrix is obtained by subsampling 
the rows of the N x N Fourier matrix F, with entries Fij = e 2 ^^^. More precisely we sample 
re/2 rows of F with replacement, construct two rows of A by taking real and imaginary part, and 
normalize the columns of the resulting matrix. 

Figure [13] presents analogous results for the random demodulator matrix which is at the core of the 
analog-to-digital converter (ADC) of [TLD + 10] , Schematically, this is obtained by normalizing the 
columns of A = HDF, with F a Fourier matrix, D a random diagonal matrix with Da 6 { + 1, — 1} 
uniformly at random, and H an 'accumulator': 



Both these examples show good agreement between the asymptotic prediction provided by The- 
orem 16.31 and the empirical mean square error. Such an agreement is surprising given that in both 
cases the measurement matrix is generated with a small amount of randomness, compared to a 
Gaussian matrix. For instance, the ADC matrix only requires re random bits. Although statistically 
significant discrepancies can be observed (cf. for instance Fig. [T3l) . the present approach provides 
quantitative predictions of great interest for design purposes. For a more systematic investigation, 
we refer to [BBMll] . 

6.8 Comparison with other analysis approaches 

The analysis presented here is significantly different from more standard approaches. We derived an 
exact characterization for the high-dimensional limit of the LASSO estimation problem under the 
assumption of converging sequences of random sensing matrices. 

Alternative approaches assume an appropriate 'isometry', or 'incoherence' condition to hold for 
A. Under this condition upper bounds are proved for the mean square error. For instance Candes, 
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Romberg and Tao |CRT06| prove that the mean square error is bounded by Co 2 for some constant 
C . Work by Candes and Tao [CT07] on the analogous Dantzig selector, upper bounds the mean 
square error by Ca 2 (k/n) logra, with k the number of non-zero entries of the signal x. 

These type of results are very robust but present two limitations: (i) They do not allow to 
distinguish reconstruction methods that differ by a constant factor (e.g. two different values of A); 
(ii) The restricted isometry condition (or analogous ones) is quite restrictive. For instance, it holds 
for random matrices only under very strong sparsity assumptions. These restrictions are intrinsic to 
the worst-case point of view developed in jCRT061 ICT07| . 

Guarantees have been proved for correct support recovery in |ZY06| . under an incoherence as- 
sumption on A. While support recovery is an interesting conceptualization for some applications 
(e.g. model selection), the metric considered in the present paper (mean square error) provides 
complementary information and is quite standard in many different fields. 

Close to the spirit of the treatment presented here, |RFG09| derived expressions for the mean 
square error under the same model considered here. Similar results were presented recently in 
|KWT09l GBS09;. These papers argue that a sharp asymptotic characterization of the LASSO 
risk can provide valuable guidance in practical applications. Unfortunately, these results were non- 
rigorous and were obtained through the famously powerful 'replica method' from statistical physics 
|MM09| . The approach discussed here offers two advantages over these recent developments: (i) It is 
completely rigorous, thus putting on a firmer basis this line of research; (ii) It is algorithmic in that 
the LASSO mean square error is shown to be equivalent to the one achieved by a low-complexity 
message passing algorithm. 

Finally, recently random models for the measurement matrix have been studied in |CP09t ICPlOb] . 
The approach developed in these papers allows to treat matrices that do not necessarily satisfy the 
restricted isometry property or similar conditions, and applies to a general class of random matrices 
A with i.i.d. rows. On the other hand, the resulting bounds are not asymptotically sharp. 

7 Generalizations 

The single most important advantage of the point of view based on graphical models is that it offers 
a unified disciplined approach to exploit structural information on the signal x. The use of such 
information can dramatically reduce the number of required compressed sensing measurements. 

'Model-based' compressed sensing jBCDHIO] provides a general framework for specifying such 
information. However, it focuses on 'hard' combinatorial information about the signal. Graphical 
models are instead a rich language for specifying 'soft' dependencies or constraints, and more complex 
models. These might include combinatorial constraints, but vastly generalize them. Also, graphical 
models come with an algorithmic arsenal that can be applied to leverage the potential of such more 
complex signal models. 

Exploring such potential generalizations is -to a large extent- a future research program which 
is still in its infancy. Here we will only discuss a few examples. 

7.1 Structured priors. . . 

Block-sparsity is a simple example of combinatorial signal structure. We decompose the signal as 
x = (x.b(i)) x B(2)i •••) x B(l)) where x B n\ G W 1 ^ is a block for I G {!,...,£}. Only a fraction 
£ € (0, 1) of the blocks is non- vanishing. This type of model naturally arises in many applications: 



31 




Figure 14: Two possible graphical representation of the block-sparse compressed sensing model (and cor- 
responding cost function (|2.5p ). Upper squares correspond to measurements y a , a € [m], and lower squares 
to the block sparsity constraint (in this case blocks have size 2). On the left, circles correspond to variables 
Xi 6 1, i 6 [n]. On they right, double circles correspond to blocks xb(i) £ K ra ^, i £ [£]. 



for instance the case £ = n/2 (blocks of size 2) can model signals with complex- valued entries. Larger 
blocks can correspond to shared sparsity patterns among many vectors, or to clustered sparsity. 
It is customary in this setting to replace the LASSO cost function with the following 

1 1 
Cf° c \z) = jllv-Aslli + A^lkflwIla- (7.1) 

i=l 

The block-^2 regularization promotes block sparsity. Of course, the new regularization can be inter- 
preted in terms of a new assumed prior that factorizes over blocks. 

Figure Q31 reproduces two possible graphical structures that encode the block-sparsity constraint. 
In the first case, this is modeled explicitly as a constraint over blocks of variable nodes, each block 
comprising n/l variables. In the second case, blocks correspond explicitly to variables taking values 
in R n /^. Each of these graphs dictates a somewhat different message passing algorithm. 

An approximate message passing algorithm suitable for this case is developed in [DM10] , Its 
analysis allows to generalize £q — l\ phase transition curves reviewed in Section 16.61 to the block 
sparse case. This quantifies precisely the benefit of minimizing f)7. 1 1) over simple £\ penalization. 

As mentioned above, for a large class of signals sparsity is not uniform: some subsets of entries 
are sparser than others. Tanaka and Raymond |TR10j . and Som, Potter and Schniter and [SSSlOj 
studied the case of signals with multiple level of sparsity. The simplest example consists of a signal 
x = { x B(i)-> x B(2))-> where xb(\) £ K ni , xb(2) £ ^™ 2 ) n i + n 2 = n. Block i £ {1,2} has a fraction of 
non-zero entries, with e\ 7^ £2- In the most complex case, one can consider a general factorized prior 

n 

p(dx) = Y[pi(dXi) , 

i=i 

where each i £ [n] has a different sparsity parameter £j £ (0,1), and pi £ JF Ei . In this case it is 
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Figure 15: Graphical model for compressed sensing of signals with clustered support. The support structure 
is described by an Hidden Markov Model comprising the lower factor nodes (filled squares) and variable nodes 
(empty circles). Upper variable nodes correspond to the signal entries Xi, i € [n], and upper factor nodes to 
the measurements y a , a € [m] . 



natural to use a weighted-^ i regularization, i.e. to minimize 

1 n 
C7^ t {z)^ l -\\y-Az\\l + \Y,^ l \, (7-2) 

i=i 

for a suitable choice of the weights w\, . . . , w n > 0. The paper |TR10j studies the case A — > (equiv- 
alent to minimizing ^ i Wi\Zi\ subject to y = Az), using non-rigorous statistical mechanics techniques 
that are equivalent to the state evolution approach presented here. Within a high-dimensional limit, 
it determines optimal tuning of the parameters u>i, for given sparsities £i. The paper |SSS10j fol- 
lows instead the state evolution approach explained in the present chapter. The authors develop a 
suitable AMP iteration and compute the optimal thresholds to be used by the algorithm. These are 
in correspondence with the optimal weights Wi mentioned above, and can be also interpreted within 
the minimax framework developed in the previous pages. 

The graphical model framework is particularly convenient for exploiting prior information that 
is probabilistic in nature, see in particular [CHDB081 ICICBIO] . A prototypical example was studied 
by Schniter [SchlOj who considered the case in which the signal x is generated by an Hidden Markov 
Model (HMM). As for the block-sparse model, this can be used to model signals in which the non-zero 
coefficients are clustered, although in this case one can accomodate greater stochastic variability of 
the cluster sizes. 

In the simple case studied in detail in [SchlOj , the underlying Markov chain has two states indexed 
by Si S {0, 1}, and 

n n—l 

p(dx)= ^2 {Y[p( dx i\ s i) ■ Y[p( s i+i\ s i) •Pi(si)} i (7-3) 

si,...,s„ i=l i=l 
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Figure 16: Graphical model for compressed sensing of signals with tree-structured prior. The support struc- 
ture is a tree graphical model, comprising factor nodes and variable nodes in the lower part of the graph. 
Upper variable nodes correspond to the signal entries Xi, i G [n], and upper factor nodes to the measurements 
y a , a G [m]. 



where p( • |0) and p(-|l) belong to two different sparsity classes J- eo , J- E1 . For instance one can 
consider the case in which £q = and e\ = 1, i.e. the support of x coincides with the subset of 
coordinates such that Sj = 1. 

Figure [15] reproduces the graphical structure associated with this type of models. This can be 
partitioned in two components: a bipartite graph corresponding to the compressed sensing mea- 
surements (upper part in Fig. [T5l) and a chain graph corresponding to the Hidden Markov Model 
structure of the prior (lower part in Fig. [T5]) . 

Reconstruction was performed in [SchlOj using a suitable generalization of AMP. Roughly speak- 
ing, inference is performed in the upper half of the graph using AMP and in the lower part using 
the standard forward-backward algorithm. Information is exchanged across the two component in a 
way that is very similar to what happens in turbo codes |RU08] . 

The example of HMM priors clarifies the usefulness of the graphical model structure in eliciting 
tractable substructures in the probabilistic model and hence leading to natural iterative algorithms. 
For an HMM prior, inference can be performed efficiently because the underlying graph is a simple 
chain. 

A broader class of priors for which inference is tractable is provided by Markov-tree distributions 
[SPSlOj . These are graphical models that factors according to a tree graph (i.e. a graph without 
loops) . A cartoon of the resulting compressed sensing model is reproduced in Figure [161 

The case of tree-structured priors is particularly relevant in imaging applications. Wavelet coef- 
ficients of natural images are sparse (an important motivating remark for compressed sensing) and 
non-zero entries tend to be localized along edges in the image. As a consequence, they cluster in 
subtrees of the tree of wavelet coefficients. A Markov-tree prior can capture well this structure. 
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Figure 17: Sparse sensing graph arising in a networking application. Each network flow (empty circles below) 
hashes into k = 2 counters (filled squares). 



Again, reconstruction is performed exactly on the tree-structured prior (this can be done effi- 
ciently using belief propagation), while AMP is used to do inference over the compressed sensing 
measurements (the upper part of Figure [T6|) . 

7.2 Sparse sensing matrices 

Throughout this review we focused for simplicity on dense measurement matrices A. Several of 
the mathematical results presented in the previous sections do indeed hold for dense matrices with 
i.i.d. components. Graphical models ideas are on the other hand particularly useful for sparse 
measurements. 

Sparse sensing matrices present several advantages, most remarkably lower measurement and 
reconstruction complexities [BGI + 08] , While sparse constructions are not suitable for all applica- 
tions, they appear a promising solution for networking applications, most notably in network traffic 
monitoring [CM041 |LMP+Q8bj . 

In an over-simplified example, one would like to monitor the sizes of n packet flows at a router. 
It is a recurring empirical observation that most of the flows consist of a few packets, while most of 
the traffic is accounted for by a few flows. Denoting by x±, X2, ■ ■ ■ x n the flow sizes (as measured, for 
instance, by the number of packets belonging to the flow), it is desirable to maintain a small sketch 
of the vector x = (x\, . . . , x n ). 

Figure [T7] describes a simple approach: flow i hashes into a small number -say k— of memory 
spaces, di = {ai(i), . . . , a^(i)} C [m]. Each time a new packet arrives for flow i, the counters in di 
are incremented. If we let y = (y±, . . . , y m ) be the contents of the counters, we have 

y = Ax, (7.4) 

where x > and A is a matrix with i.i.d. columns with k entries per column equal to 1 and all the 
other entries equal to 0. While this simple scheme requires unpractically deep counters (the entries 
of y can be large), |LMP+08b] showed how to overcome this problem by using a multi-layer graph. 

Numerous algorithms were developed for compressed sensing reconstruction with sparse mea- 
surement matrices [CM041 IXH071 lBGI+081 Hnd08j . Most of these algorithms are based on greedy 
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Figure 18: Factor graph describing the cost function (|7.8[) for the matrix completion problem. Variables 
Xi, yj <E R r are to be optimized over. The cost is a sum of pairwise terms (filled squares) corresponding to the 
observed entries in M. 



methods, which are essentially of message passing type. Graphical models ideas can be used to 
construct such algorithms in a very natural way. For instance, the algorithm of |LMP + 08b] (see also 
[LMP071 ILMP08a"l ICSWIOj for further analysis of the same algorithm) is closely related to the ideas 
presented in the rest of this chapter. It uses messages x\_^ a (from variable nodes to function nodes) 
and (from function nodes to variable nodes). These are updated according to 

J min {r^j : b G di \ a} at even iterations t, . . 

| maxjr^j : b G di \ a} at odd iterations t, 

where da denotes the set of neighbors of node a in the factor graph. These updates are very similar 
to Eqs. (|5.1ip . (|5.12p introduced earlier in our derivation of AMP. 



7.3 Matrix completion 

'Matrix completion' is the task of inferring an (approximately) low rank matrix from observations on 
a small subset of its entries. This problem has attracted considerable interest offer the last two years 
due to its relevance in a number of applied domains (collaborative filtering, positioning, computer 
vision, etc.). 

Significant progress has been achieved on the theoretical side. The reconstruction question has 
been addressed in analogy with compressed sensing in [CR09} ICPlOal IGro09l INW09| . while an 
alternative approach based on greedy methods was developed in |KMO10allKMO10b[IKM10aj . While 
the present chapter does not treat matrix completion in any detail, it is interesting to mention that 
graphical models ideas can be useful in this context as well. 

Let M G ]R mxn be the matrix to be reconstructed, and assume that a subset E C [m] x [n] of 
its entries is observed. It is natural to try to accomplish this task by minimizing the £2 distance on 
observed entries. For X G R mxr , Y G R nxr , we introduce therefore the cost function 

C(X,Y)= 1 -\\V E (M-XY T )f F (7.7) 
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where Ve is the projector that sets to zero the entries outside E (i.e. VE(L)ij = Lij if (i, j) E E 
and Ve{J-^)%3 = otherwise). If we denote the rows of X as x±, . . . ,x m £ W and the rows in Y as 
j/i,...,j/ n £l r , the above cost function can be rewritten as 

C{X,Y) = \ {Mij- (xi, yj )) 2 , (7.8) 

with ( • , •,) the standard scalar product on M r . This cost function factors accordingly the bipartite 
graph G with vertex sets V\ = [m] and V2 = [n] and edge set E. The cost decomposes as a sum of 
pairwise terms associated with the edges of G. 

Figure [18] reproduces the graph G that is associated to the cost function C(X, Y). It is remarkable 
some properties of the reconstruction problem can be 'read' from the graph. For instance, in the 
simple case r = 1, the matrix M can be reconstructed if and only if G is connected (banning for 
degenerate cases) [KMO08] . For higher values of the rank r, rigidity of the graph is related to 
uniqueness of the solution of the reconstruction problem [SC09j . Finally, message passing algorithms 
for this problem were studied in [KYP10[ IKM10a| . 



7.4 General regressions 

The basic reconstruction method discussed in this review is the regularized least-squares regression 
defined in Eq. (|2.8j) . also known as the LASSO. While this is by far the most interesting setting 
for signal processing applications, for a number of statistical learning problems, the linear model 
(jl.ip is not appropriate. Generalized linear models provide a flexible framework to extend the ideas 
discussed here. 

An important example is logistic regression, which is particularly suited for the case in which the 
measurements y\, ■ ■ ■ y m are 0-1 valued. Within logistic regression, these are modeled as independent 
Bernoulli random variables with 

p(y a = l\x) = , (7.9) 

1 + e A °- x 

with A a a vector of 'features' that characterizes the a-th experiment. The objective is to learn the 
vector x of coefficients that encodes the relevance of each feature. A possible approach consists in 
minimizing the regularized (negative) log-likelihood, that is 

m m 
a=l a=l 

The papers [Ran 10, BM10] develops approximate message passing algorithms for solving optimization 
problems of this type. 
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